Since the early 2000s, digital music libraries and music streaming services on the internet have grown exponentially. Digital music is now responsible for 56% of the total music industry revenue. Streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. The growth in digital music has inadvertently led to an overabundance in available music - ultimately making it difficult for users to find music new or existing music to enjoy listening. Hence, streaming platforms have the challenge of introducing new enjoyable songs for users. Music platforms often try to achieve this by curating playlists based of a users listening taste. That is, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics. The complexity of playlist generation lies with the ability to correctly cluster songs which are similar. We shall take a slightly different, rather more simple approach in this project. That is, we are to look to classify songs into two genres. Genre classification although very different from playlist curation, shares similar principles with respect to the applied algorithms and song characteristics.
We shall examine data compiled by a research group known as The Echo Nest. Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will learn how to clean our data, do some exploratory data visualization, and use feature reduction towards the goal of feeding our data through some simple machine learning algorithms, such as decision trees and logistic regression.
A song is about more than its title, artist, and number of listens. We have a dataset that has musical features of each track such as danceability and acousticness on a scale from -1 to 1. The data was complied and provided by The Echo Nest.
Speechiness, Acousticness, Instrumentalness, Liveness and Valence are all used to explain the content within each song. All these variables range in value from 0 to 100, where larger values represent greater amounts of the variable present in the song. Speechiness refers to how much of the song sounds although it has been spoken. Accoustiness refers to the amount of the song which contains an acoustic instrument. Instrumentalness is a measure of how instrumental a given song is. The likelihood of a song being a live recording is measured by Liveness - higher values indicate a higher likelihood of the song being a live recording. Lastly, Valence is a measure of how positive the song is overall.
Tempo is measured in Beats Per Minute (BPM). Tempo ranges from 60 to over 200 in this dataset - where 200 BPM would indicate an extremely fast song. Genre describes a conventional category that identifies some pieces of music as belonging to a shared tradition or set of conventions. Danceability is a measure of how danceable a given song is. This is measured on a scale from 0 to 100. Energy is measure on the same scale, however this variable is a measure for how energetic a song is.
# Modules
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold, cross_val_score
These exist in two different files, which are in different formats - CSV and JSON. CSV is a popular file format for denoting tabular data, JSON is another common file format in which databases often return the results of a given query.
We start by creating two pandas DataFrames out of these files that we can merge, so we have features and labels (X and Y) for the classification later on.
# Read in track metadata with genre labels
tracks = pd.read_csv("datasets/fma-rock-vs-hiphop.csv")
# Read in track metrics with the features
echonest_metrics = pd.read_json("datasets/echonest-metrics.json", precise_float=True)
# Lets see the first read in data sets
tracks.head(10)
The first data set, has song and has a variety of information recorded for it. But does not have any musical characteristics. Let's see what the second read in data set looks like. It has all the musical features.
# Lets see the second read in data sets
echonest_metrics.head(10)
# Merge the relevant columns of tracks and echonest_metrics
echo_tracks = pd.merge(echonest_metrics, tracks[["track_id", "genre_top"]], on="track_id")
# Inspect the resultant dataframe
echo_tracks.info() # can also use echo_tracks.describe()
We see that the tracks data set has information on the artist, date of release etc, while the echo data set has more information about the audio characterstics. Not all the variables are important for this project's purpose. So we shall select specific columns froms track data set, namely track_id and genre_top, to merge into our final data set shown above.
#Check for Duplicates
echo_tracks.duplicated().sum()
Function returned ‘0’. This means, there is not a single duplicate value present in our dataset
#Find null values
echo_tracks.isnull().sum()
No null values in any of our variables.
We begin by looking at the univariate distributions of the variables. This shall be accompanied by variable summaries to gauge the nature of the distribution for the specific audio characteristics. This is followed by an examination into the pairwise relationships between continuous variables.
echo_tracks.describe().transpose() #descripitve stats of data
plt.subplot(1, 3, 1)
plt.tight_layout()
echo_tracks.acousticness.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for acousticness')
plt.subplot(1, 3, 2)
echo_tracks.danceability.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for danceability')
plt.subplot(1, 3, 3)
echo_tracks.energy.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for energy')
plt.show()
plt.subplot(1, 3, 2)
echo_tracks.instrumentalness.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for instrumentalness')
plt.subplot(1, 3, 1)
echo_tracks.liveness.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for liveness')
plt.subplot(1, 3, 3)
echo_tracks.speechiness.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for speechiness')
plt.show()
plt.subplot(1, 2, 1)
echo_tracks.tempo.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for tempo')
plt.subplot(1, 2, 2)
echo_tracks.valence.plot.density(color='green', figsize = (15,5))
plt.title('Density plot for valence')
plt.show()
We see all the density plots for the audio characteristcs (variables) for the tracks.
Now looking at the correlation matrix for the data.
We typically want to avoid using variables that have strong correlations with each other - hence avoiding feature redundancy - for a few reasons:
To get a sense of whether there are any strongly correlated features in our data, we will use built-in functions in the pandas package.
# Create a correlation matrix
corr_metrics = echo_tracks.corr()
corr_metrics
sns.set(rc = {'figure.figsize':(15,8)}) # used to control the theme and configurations of the seaborn plot
sns.heatmap(echo_tracks.corr(), annot=True, cmap = "summer"); # heatmap with seaborn
From initial inspection, we can identify a relatively weak positive correlation between variables danceability and valence. Danceability quantifies how suitable a track is for dancing based on a combination of musical elements, like tempo, rhythm, and beat. Songs with higher danceability have stronger and more regular beats. On the other hand, valence is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). I guess it does make sense that more positive sounding or feeling music (higher valence songs) are more danceable, hence the stronger correlation (and vice-versa for low valence music).
There are some relatively weak negative correlations, namely instrumentalness and speechiness. Given that speechiness refers to how much of the song sounds and instrumentalness is a measure of how instrumental a given song is, we can expect this sort of relationship between the data variables - we might have even expected a greater correlation.
Lets now explore some count plots to evaluate the categorical variables in our data.
# Value counts
echo_tracks["genre_top"].value_counts()
sns.set(rc = {'figure.figsize':(10,5)}) # used to control the theme and configurations of the seaborn plot
sns.countplot(x="genre_top", data=echo_tracks, hue="genre_top");
An important investigation into the relationship of the musical characteristic across the two genres is now looked at. We plot the box plots side by side to allow for easier comparison, between the two genres.
sns.set(rc = {'figure.figsize':(18,12)}) # used to control the theme and configurations of the seaborn plot
plt.subplot(5, 2, 1)
sns.boxplot(x=echo_tracks["genre_top"], y=echo_tracks["acousticness"], width=0.3);
plt.subplot(5, 2, 2)
sns.boxplot( x=echo_tracks["genre_top"], y=echo_tracks["danceability"], width=0.3)
plt.subplot(5, 2, 3)
sns.boxplot(x=echo_tracks["genre_top"], y=echo_tracks["energy"], width=0.3);
plt.subplot(5, 2, 4)
sns.boxplot( x=echo_tracks["genre_top"], y=echo_tracks["instrumentalness"], width=0.3)
plt.subplot(5, 2, 5)
sns.boxplot(x=echo_tracks["genre_top"], y=echo_tracks["liveness"], width=0.3);
plt.subplot(5, 2, 6)
sns.boxplot( x=echo_tracks["genre_top"], y=echo_tracks["speechiness"], width=0.3)
plt.subplot(5, 2, 7)
sns.boxplot(x=echo_tracks["genre_top"], y=echo_tracks["tempo"], width=0.3);
plt.subplot(5, 2, 8)
sns.boxplot( x=echo_tracks["genre_top"], y=echo_tracks["valence"], width=0.3)
plt.tight_layout()
plt.show()
With respect to accousticness, rock seems to have a higher median, but the acousticness in majority of rock songs vary greater than that of Hip Hop. The two genres appear to have very similar distributions with respect to liveness. Hip-hop has a greater speechiness score and understandably so.The genres also differ with respwhact to valence - Hip Hop having generally a higher score, similar results are found when investigating the two genres and their danceability.
It is useful to simplify our models and use as few features as necessary to achieve the best result. Since we didn't find any particular strong correlations between our features, we can instead use a common approach to reduce the number of features called principal component analysis (PCA).
Principal component analysis (PCA) is a linear dimensionality reduction technique. Simply, the idea of principal component analysis is to use several (or few) directions that capture the variation in the data as much as possible. So essentially, PCA rotates the data along the axis of highest variance, thus allowing us to determine the relative contribution of each feature of our data towards the variance between classes.
PCA begins by using standardised data and computes the sample covariance matrix. Standardising is an important component of PCA, to ensure that PCA does not emphasises those variables with higher variances. Using the sample covariance matrix of the data, X, PCA performs an eigen decomposition.
Each principal component represents a percentage of total variation captured from the data. The first principal component is the linear combination of x variables that has maximum variance (among all linear combinations). It accounts for as much variation in the data as possible. All subsequent principal components have this same property – they are linear combinations that account for as much of the remaining variation as possible and they are not correlated with the other principal components.
We first normalize our data. There are a few methods to do this, but a common way is through standardization, such that all features have a mean = 0 and standard deviation = 1 (the resultant is a z-score).</p>
# Define our features
features = echo_tracks.drop(["genre_top","track_id"], axis=1)
# Define our labels
labels = echo_tracks["genre_top"]
# Scale the features and set the values to a new variable
scaler = StandardScaler()
scaled_train_features = scaler.fit_transform(features)
With our preprocessed data we can proceed with PCA to determine by how much we can reduce the dimensionality of our data. We begin with scree-plots and cumulative explained ratio plots to find the number of components to use in further analyses.
A scree plot display the number of components against the variance explained by each component, sorted in descending order of variance. Using this we can better assess which components explain a sufficient amount of variance in our data. here, an 'elbow' (a steep drop from one data point to the next) in the plot is typically used to decide on an appropriate cutoff.
# Get our explained variance ratios from PCA using all features
pca = PCA()
pca.fit(scaled_train_features)
exp_variance = pca.explained_variance_ratio_
print("% of variance explained: ", pca.explained_variance_ratio_) # Percentage of variance explained by each of the selected components
print("Number of components: ", pca.n_components_) # The estimated number of components
# plot the explained variance using a barplot
fig, ax = plt.subplots()
ax.bar(range(8), exp_variance)
ax.set_xlabel('Principal Component #');
Does not appear to be a clear elbow in this scree plot. Instead, we can also look at the cumulative explained variance plot to determine how many features are required to explain, say, about 90% of the variance (cutoffs are somewhat arbitrary here, and usually decided upon by 'rules of thumb'). Once we determine the appropriate number of components, we can perform PCA with that many components, ideally reducing the dimensionality of our data.
# Calculate the cumulative explained variance
cum_exp_variance = np.cumsum(exp_variance)
# Plot the cumulative explained variance and draw a dashed line at 0.90.
fig, ax = plt.subplots()
ax.plot(range(8), cum_exp_variance)
ax.axhline(y=0.9, linestyle='--')
n_components = 6 # Chosen number of Components to use for Analysis
We have chosen to use 6 components, as it explains enough of the variation that is deemed appropriate.
# Perform PCA with the chosen number of components and project data onto components
pca = PCA(n_components, random_state=10)
pca.fit(scaled_train_features)
pca_projection = pca.transform(scaled_train_features)
Now we can use the lower dimensional PCA projection of the data for the classification.
We shall use our lower dimensional representation of the data (PCA components) to classify songs into genres. Here, we will be using a simple algorithm known as a decision tree. Decision trees are rule-based classifiers that take in features and follow a 'tree structure' of binary decisions to ultimately classify a data point into one of two or more categories. In addition to being easy to both use and interpret, decision trees allow us to visualize the 'logic flowchart' that the model generates from the training data.
So first we split our dataset into 'train' and 'test' subsets, where the 'train' subset will be used to train our model while the 'test' dataset allows for model performance validation.
# Split our data
train_features, test_features, train_labels, test_labels = train_test_split(pca_projection, labels, random_state=10)
# Train our decision tree
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)
# Predict the labels for the test data
pred_labels_tree = tree.predict(test_features)
Although our tree's performance may be decent, it's a bad idea to immediately assume that it's therefore the perfect tool for this job - there's always the possibility of other models that will perform even better.
It's always a worthwhile idea to at least test a few other algorithms and find the one that's best for our data. Sometimes simplest is best, and so we will start by applying logistic regression. Logistic regression makes use of what's called the logistic function to calculate the odds that a given data point belongs to a given class.
Once we have both models, we can compare them on a few performance metrics, such as false positive and false negative rate (or how many points are inaccurately classified). We also shall assess other metrics provided by our classification report.
# Train our logistic regression and predict labels for the test set
logreg = LogisticRegression(random_state=10)
logreg.fit(train_features, train_labels)
pred_labels_logit = logreg.predict(test_features)
# Create the classification report for both models
class_rep_tree = classification_report(test_labels, pred_labels_tree)
class_rep_log = classification_report(test_labels, pred_labels_logit)
print("Decision Tree: \n", class_rep_tree)
print("Logistic Regression: \n", class_rep_log)
Both our models do similarly well, boasting an average precision of 87% each. However, looking at our classification report, we can see that rock songs are fairly well classified, but hip-hop songs are disproportionately misclassified as rock songs.
This can be expected given the number of data points we have for each class, we see that we have far more data points for the rock classification than for hip-hop, potentially skewing our model's ability to distinguish between classes. This also tells us that most of our model's accuracy is driven by its ability to classify just rock songs, which is less than ideal.
To account for this, we can weight the value of a correct classification in each class inversely to the occurrence of data points for each class. Since a correct classification for "Rock" is not more important than a correct classification for "Hip-Hop" (and vice versa), we only need to account for differences in sample size of our data points when weighting our classes here, and not relative importance of each class.
This is what we try to do below.
# Subset only the hip-hop tracks, and then only the rock tracks
hop_only = echo_tracks.loc[echo_tracks["genre_top"] == "Hip-Hop"]
# sample the rocks songs to be the same number as there are hip-hop songs
rock_only = echo_tracks.loc[echo_tracks["genre_top"] == "Rock"].sample(len(hop_only), random_state=10)
# concatenate the dataframes rock_only and hop_only
rock_hop_bal = pd.concat([rock_only, hop_only])
# The features, labels, and pca projection are created for the balanced dataframe
features = rock_hop_bal.drop(['genre_top', 'track_id'], axis=1)
labels = rock_hop_bal['genre_top']
pca_projection = pca.fit_transform(scaler.fit_transform(features))
# Redefine the train and test set with the pca_projection from the balanced data
train_features, test_features, train_labels, test_labels = train_test_split(pca_projection, labels, random_state=10)
Now we can examine whether balancing our dataset improve model bias. We've now balanced our dataset, but in doing so, we've removed a lot of data points that might have been crucial to training our models. Let's test to see if balancing our data improves model bias towards the "Rock" classification while retaining overall classification performance.
Note that we have already reduced the size of our dataset and will go forward without applying any dimensionality reduction. In practice, we would consider dimensionality reduction more rigorously when dealing with vastly large datasets and when computation times become prohibitively large.
# Train our decision tree on the balanced data
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)
pred_labels_tree = tree.predict(test_features)
# Train our logistic regression on the balanced data
logreg = LogisticRegression(random_state=10)
logreg.fit(train_features, train_labels)
pred_labels_logit = logreg.predict(test_features)
# Compare the models
print("Decision Tree: \n", classification_report(test_labels, pred_labels_tree))
print("Logistic Regression: \n", classification_report(test_labels, pred_labels_logit))
Success. Balancing our data has removed bias towards the more prevalent class. To get a good sense of how well our models are actually performing, we can apply what's called cross-validation (CV). This step allows us to compare models in a more rigorous fashion. This shall be done next.
Since the way our data is split into train and test sets can impact model performance, CV attempts to split the data multiple ways and test the model on each of the splits. Although there are many different CV methods, all with their own advantages and disadvantages, we will use what's known as K-fold CV here.
K-fold first splits the data into K different, equally sized subsets. Then, it iteratively uses each subset as a test set while using the remainder of the data as train sets. Finally, we can then aggregate the results from each fold for a final model performance score.
# Set up our K-fold cross-validation
kf = KFold(n_splits=10)
tree = DecisionTreeClassifier(random_state=10)
logreg = LogisticRegression(random_state=10)
# Train our models using KFold cv
tree_score = cross_val_score(tree, pca_projection, labels, cv=kf)
logit_score = cross_val_score(logreg, pca_projection, labels, cv=kf)
# Print the mean of each array of scores
print("Decision Tree:", np.mean(tree_score), "Logistic Regression:", np.mean(logit_score))
We see that the "simpler" classical method of logistic regression actually performed better than the implementation of the decision tree.